{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Predicting the value of states in a frozen lake environment\n",
    "We learned that in the prediction method, the policy is given as an input and we predict\n",
    "value function using the given policy. So let's initialize a random policy and predict the\n",
    "value function (state values) of the frozen lake environment using the random policy.\n",
    "\n",
    "First, let's import the necessary libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gym\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we create the frozen lake environment using gym:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make('FrozenLake-v0')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "source": [
    "Define the random policy which returns the random action by sampling from the action\n",
    "space:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def random_policy():\n",
    "    return env.action_space.sample()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's define the dictionary for storing the value of states and we initialize the value of all the\n",
    "states to 0.0:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "V = {}\n",
    "for s in range(env.observation_space.n):\n",
    "    V[s] = 0.0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Initialize the discount factor $\\gamma$ and the learning rate $\\alpha$: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "alpha = 0.85\n",
    "gamma = 0.90"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the number of episodes and number of time steps in the episode:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_episodes = 5000\n",
    "num_timesteps = 1000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computing the value of states\n",
    "Now, let's compute the value function (state values) using the given random policy as:\n",
    "\n",
    "$$V(s) = V(s) + \\alpha (r + \\gamma V(s') - V(s)) $$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "#for each episode\n",
    "for i in range(num_episodes):\n",
    "    \n",
    "    #initialize the state by resetting the environment\n",
    "    s = env.reset()\n",
    "    \n",
    "    #for every step in the episode\n",
    "    for t in range(num_timesteps):\n",
    "        \n",
    "        #select an action according to random policy\n",
    "        a = random_policy()\n",
    "        \n",
    "        #perform the selected action and store the next state information\n",
    "        s_, r, done, _ = env.step(a)\n",
    "        \n",
    "        #compute the value of the state\n",
    "        V[s] += alpha * (r + gamma * V[s_]-V[s])\n",
    "        \n",
    "        #update next state to the current state\n",
    "        s = s_\n",
    "        \n",
    "        #if the current state is the terminal state then break\n",
    "        if done:\n",
    "            break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After all the iterations, we will have a value of all the states according to the given random\n",
    "policy. \n",
    "\n",
    "## Evaluating the value of states \n",
    "\n",
    "Now, let's evaluate our value function (state values). First, let's convert our value dictionary\n",
    "to a pandas data frame for more clarity:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(list(V.items()), columns=['state', 'value'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before checking the value of the states, let's recollect that in the gym all the states in the\n",
    "frozen lake environment will be encoded into numbers. Since we have 16 states, all the\n",
    "states will be encoded into numbers from 0 to 15 as shown below:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "![title](Images/1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Now, Let's check the value of the states:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>state</th>\n",
       "      <th>value</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.004119</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0.000092</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>0.000428</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>0.000088</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>0.001522</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6</td>\n",
       "      <td>6</td>\n",
       "      <td>0.000089</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>7</td>\n",
       "      <td>7</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>8</td>\n",
       "      <td>8</td>\n",
       "      <td>0.000141</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>9</td>\n",
       "      <td>9</td>\n",
       "      <td>0.526370</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>0.009634</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>11</td>\n",
       "      <td>11</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>12</td>\n",
       "      <td>12</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>13</td>\n",
       "      <td>13</td>\n",
       "      <td>0.273292</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>14</td>\n",
       "      <td>14</td>\n",
       "      <td>0.081485</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>15</td>\n",
       "      <td>15</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    state     value\n",
       "0       0  0.004119\n",
       "1       1  0.000092\n",
       "2       2  0.000428\n",
       "3       3  0.000088\n",
       "4       4  0.001522\n",
       "5       5  0.000000\n",
       "6       6  0.000089\n",
       "7       7  0.000000\n",
       "8       8  0.000141\n",
       "9       9  0.526370\n",
       "10     10  0.009634\n",
       "11     11  0.000000\n",
       "12     12  0.000000\n",
       "13     13  0.273292\n",
       "14     14  0.081485\n",
       "15     15  0.000000"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "As we can observe, now we have the value of all the states and also we can notice that\n",
    "the value of all the terminal states (hole states and goal state) is zero.\n",
    "\n",
    "Now that we have understood how TD learning can be used for the prediction task, in the\n",
    "next section, we will learn how to use TD learning for the control task. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
